10 research outputs found

    Automatic processing of code-mixed social media content

    Get PDF
    Code-mixing or language-mixing is a linguistic phenomenon where multiple language mix together during conversation. Standard natural language processing (NLP) tools such as part-of-speech (POS) tagger and parsers perform poorly because such tools are generally trained with monolingual content. Thus there is a need for code-mixed NLP. This research focuses on creating a code-mixed corpus in English-Hindi-Bengali and using it to develop a world-level language identifier and a POS tagger for such code-mixed content. The first target of this research is word-level language identification. A data set of romanised and code-mixed content written in English, Hindi and Bengali was created and annotated. Word-level language identification (LID) was performed on this data using dictionaries and machine learn- ing techniques. We find that among a dictionary-based system, a character-n-gram based linear model, a character-n-gram based first order Conditional Random Fields (CRF) and a recurrent neural network in the form of a Long Short Term Memory (LSTM) that consider words as well as characters, LSTM outperformed the other methods. We also took part in the First Workshop of Computational Approaches to Code-Switching, EMNLP, 2014 where we achieved the highest token-level accuracy in the word-level language identification task of Nepali-English. The second target of this research is part-of-speech (POS) tagging. POS tagging methods for code- mixed data (e.g. pipeline and stacked systems and LSTM-based neural models) have been implemented, among them, neural approach outperformed the other approach. Further, we investigate building a joint model to perform language identification and POS tagging jointly. We compare between a factorial CRF (FCRF) based joint model and three LSTM-based multi-task models for word-level language identification and POS tagging. The neural models achieve good accuracy in language identification and POS tagging by outperforming the FCRF approach. Further- more, we found that it is better to go for a multi-task learning approach than to perform individual task (e.g. language identification and POS tagging) using neural approach. Comparison between the three neural approaches revealed that without using task-specific recurrent layers, it is possible to achieve good accuracy by careful handling of output layers for these two tasks e.g. LID and POS tagging

    NextGen AML: distributed deep learning based language technologies to augment anti money laundering Investigation

    Get PDF
    Most of the current anti money laundering (AML) systems, using handcrafted rules, are heavily reliant on existing structured databases, which are not capable of effectively and efficiently identifying hidden and complex ML activities, especially those with dynamic and timevarying characteristics, resulting in a high percentage of false positives. Therefore, analysts1 are engaged for further investigation which significantly increases human capital cost and processing time. To alleviate these issues, this paper presents a novel framework for the next generation AML by applying and visualizing deep learning-driven natural language processing (NLP) technologies in a distributed and scalable manner to augment AML monitoring and investigation. The proposed distributed framework performs news and tweet sentiment analysis, entity recognition, relation extraction, entity linking and link analysis on different data sources (e.g. news articles and tweets) to provide additional evidence to human investigators for final decisionmaking. Each NLP module is evaluated on a task-specific data set, and the overall experiments are performed on synthetic and real-world datasets. Feedback from AML practitioners suggests that our system can reduce approximately 30% time and cost compared to their previous manual approaches of AML investigation

    DCU: aspect-based polarity classification for SemEval task 4

    Get PDF
    We describe the work carried out by DCU on the Aspect Based Sentiment Analysis task at SemEval 2014. Our team submitted one constrained run for the restaurant domain and one for the laptop domain for sub-task B (aspect term polarity prediction), ranking highest out of 36 systems on the restaurant test set and joint highest out of 32 systems on the laptop test set

    Automatic processing of code-mixed social media content

    No full text
    Code-mixing or language-mixing is a linguistic phenomenon where multiple language mix together during conversation. Standard natural language processing (NLP) tools such as part-of-speech (POS) tagger and parsers perform poorly because such tools are generally trained with monolingual content. Thus there is a need for code-mixed NLP. This research focuses on creating a code-mixed corpus in English-Hindi-Bengali and using it to develop a world-level language identifier and a POS tagger for such code-mixed content. The first target of this research is word-level language identification. A data set of romanised and code-mixed content written in English, Hindi and Bengali was created and annotated. Word-level language identification (LID) was performed on this data using dictionaries and machine learn- ing techniques. We find that among a dictionary-based system, a character-n-gram based linear model, a character-n-gram based first order Conditional Random Fields (CRF) and a recurrent neural network in the form of a Long Short Term Memory (LSTM) that consider words as well as characters, LSTM outperformed the other methods. We also took part in the First Workshop of Computational Approaches to Code-Switching, EMNLP, 2014 where we achieved the highest token-level accuracy in the word-level language identification task of Nepali-English. The second target of this research is part-of-speech (POS) tagging. POS tagging methods for code- mixed data (e.g. pipeline and stacked systems and LSTM-based neural models) have been implemented, among them, neural approach outperformed the other approach. Further, we investigate building a joint model to perform language identification and POS tagging jointly. We compare between a factorial CRF (FCRF) based joint model and three LSTM-based multi-task models for word-level language identification and POS tagging. The neural models achieve good accuracy in language identification and POS tagging by outperforming the FCRF approach. Further- more, we found that it is better to go for a multi-task learning approach than to perform individual task (e.g. language identification and POS tagging) using neural approach. Comparison between the three neural approaches revealed that without using task-specific recurrent layers, it is possible to achieve good accuracy by careful handling of output layers for these two tasks e.g. LID and POS tagging

    DCU-UVT.: Word-Level Language Classification with Code-Mixed Data

    Get PDF
    This paper describes the DCU-UVT team’s participation in the Language Identification in Code-Switched Data shared task in the Workshop on Computational Approaches to Code Switching. Word-level classification experiments were carried out using a simple dictionary-based method, linear kernel support vector machines (SVMs) with and without contextual clues, and a k-nearest neighbour approach. Based on these experiments, we select our SVM-based system with contextual clues as our final system and present results for the Nepali-English and Spanish-English datasets

    Suomi-skenaariot - työkalu strategiseen ajatteluun

    Get PDF
    In social media communication, multilin-gual speakers often switch between lan-guages, and, in such an environment, au-tomatic language identification becomes both a necessary and challenging task. In this paper, we describe our work in progress on the problem of automatic language identification for the language of social media. We describe a new dataset that we are in the process of cre-ating, which contains Facebook posts and comments that exhibit code mixing be-tween Bengali, English and Hindi. We also present some preliminary word-level language identification experiments using this dataset. Different techniques are employed, including a simple unsuper-vised dictionary-based approach, super-vised word-level classification with and without contextual clues, and sequence la-belling using Conditional Random Fields. We find that the dictionary-based approach is surpassed by supervised classification and sequence labelling, and that it is im-portant to take contextual clues into con-sideration.

    NextGen AML: distributed deep learning based language technologies to augment anti money laundering Investigation

    No full text
    Most of the current anti money laundering (AML) systems, using handcrafted rules, are heavily reliant on existing structured databases, which are not capable of effectively and efficiently identifying hidden and complex ML activities, especially those with dynamic and timevarying characteristics, resulting in a high percentage of false positives. Therefore, analysts1 are engaged for further investigation which significantly increases human capital cost and processing time. To alleviate these issues, this paper presents a novel framework for the next generation AML by applying and visualizing deep learning-driven natural language processing (NLP) technologies in a distributed and scalable manner to augment AML monitoring and investigation. The proposed distributed framework performs news and tweet sentiment analysis, entity recognition, relation extraction, entity linking and link analysis on different data sources (e.g. news articles and tweets) to provide additional evidence to human investigators for final decisionmaking. Each NLP module is evaluated on a task-specific data set, and the overall experiments are performed on synthetic and real-world datasets. Feedback from AML practitioners suggests that our system can reduce approximately 30% time and cost compared to their previous manual approaches of AML investigation

    Natural Language Processing for Social Media, Second Edition

    No full text
    corecore